fix: batch pending parent root fetches to avoid 300+ sequential round-trips#694
Closed
ch4r10t33r wants to merge 7 commits intomainfrom
Closed
fix: batch pending parent root fetches to avoid 300+ sequential round-trips#694ch4r10t33r wants to merge 7 commits intomainfrom
ch4r10t33r wants to merge 7 commits intomainfrom
Conversation
When the SSZ state grows beyond ~3 MB the server switches from sending a Content-Length response to Transfer-Encoding: chunked. The previous body-reading loop called readSliceShort which internally goes through: readSliceShort → readVec → defaultReadVec → contentLengthStream contentLengthStream accesses reader.state.body_remaining_content_length but that field is not active for chunked responses (state is 'ready'), causing a panic: thread 1 panic: access of union field 'body_remaining_content_length' while field 'ready' is active Replace the manual request/response loop with client.fetch() using a std.Io.Writer.Allocating as the response_writer. fetch() calls response.readerDecompressing() + streamRemaining() which dispatches through chunkedStream or contentLengthStream correctly based on the actual transfer encoding used by the server.
After checkpoint sync, incoming blocks may carry attestations whose target slots predate the finalized anchor. IsJustifiableSlot correctly identifies these as non-justifiable, but the `try` was propagating the error fatally, causing the entire block import to fail. This creates a cascading gap: block N fails → blocks N+1..M fail (missing parent) → no epoch-boundary attestations accumulate → justified checkpoint never advances → forkchoice stays stuck in `initing` indefinitely. Fix: catch InvalidJustifiableSlot and treat it as `false`. The attestation is then silently skipped via the existing !is_target_justifiable check, exactly as all other non-viable attestations (unknown source/target/head, stale slot, etc.) are handled. The block imports successfully, the chain catches up, and the node exits the initing state. Update the test that was asserting the old (buggy) error-propagation behaviour to instead assert that process_attestations succeeds.
After checkpoint sync the forkchoice starts in the initing state and waits for a first justified checkpoint before declaring itself ready. The status-response sync handler was checking getSyncStatus() and treating fc_initing the same as synced — doing nothing. This created a deadlock: the node never requested blocks from ahead peers because it was in fc_initing, and it could never leave fc_initing because no blocks were imported. Fix the deadlock in two places: 1. Status-response handler: add an explicit fc_initing branch that requests the peer's head block when the peer is ahead of our anchor slot. This mirrors the behind_peers branch but uses head_slot for the comparison (finalized_slot is not yet meaningful in fc_initing). 2. Periodic sync refresh: every SYNC_STATUS_REFRESH_INTERVAL_SLOTS (8) slots, re-send our status to all connected peers when not synced. This recovers from the case where all peers were already connected before the fix was deployed, so no new connection event fires and the status-response handler would never be re-triggered.
When a block arrives with a missing parent, the old code immediately sent an individual blocks_by_root request for that single parent root. A syncing peer walking a long parent chain (e.g. 300 slots back) would therefore open 300+ separate libp2p streams - one per ancestor - flooding both sides with individual round-trips. Replace the immediate fire-and-forget with a deferred queue: - Add `pending_parent_roots: AutoHashMap(Root, depth)` to BeamNode. - `cacheBlockAndFetchParent` now enqueues the missing parent root instead of calling `fetchBlockByRoots` directly. - `flushPendingParentFetches` drains the map and issues a single batched blocks_by_root request for all accumulated roots. - The flush is called at every natural exit point: after the missing-parent early-return in `onGossip`, at the end of `handleGossipProcessingResult`, and at the end of `processBlockByRootChunk`. When multiple gossip blocks arrive in the same burst with the same missing ancestor, all their parent roots are now collected and sent as one request instead of N separate requests.
Contributor
Author
|
Closing in favour of a clean PR with only the relevant commits. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pending_parent_rootsqueue toBeamNodecacheBlockAndFetchParentnow enqueues the missing parent root instead of immediately firing an individualblocks_by_rootrequestflushPendingParentFetchesdrains the map and sends one batched request for all accumulated rootshandleGossipProcessingResult,processBlockByRootChunk)Problem
When a syncing peer walks a long parent chain (e.g. 300 blocks), every missing parent triggered its own
blocks_by_rootrequest on a fresh libp2p stream. This flooded both sides with 300+ individual round-trips instead of a single batched one.Effect
When multiple gossip blocks arrive in the same burst sharing the same missing ancestor, all their parent roots are collected into one map and sent as a single request. For the sequential case (one block reveals the next missing parent), the deduplication in the map also prevents duplicate fetches for the same root.
Test plan
zig build test --summary allpasses